Contextual biasing for speech recognition

ABSTRACT

A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.

CROSS REFERENCE TO RELATED APPLICATIONS

This U.S. patent application is a continuation of, and claims priorityunder 35 U.S.C. § 120 from, U.S. patent application Ser. No. 16/863,766,filed on Apr. 30, 2020, which claims priority under 35 U.S.C. § 119(e)to U.S. Provisional Application 62/863,308, filed on Jun. 19, 2019. Thedisclosures of these prior applications are considered part of thedisclosure of this application and are hereby incorporated by referencein their entireties.

TECHNICAL FIELD

This disclosure relates to contextual biasing for speech recognition.

BACKGROUND

Recognizing the context of speech is the goal of automatic speechrecognition (ASR) systems. The ability to recognize context in speech,however, is challenging given the wide variety of words that people mayspeak and the many variations in accents and pronunciation. In manycases, the types of words and phrases that a person speaks variesdepending on the context the person finds himself or herself in.

Contextual automated speech recognition (ASR) involves biasing speechrecognition towards a given context, such as towards a user's ownplaylist, contacts, or geographic place names. Context informationusually includes a list of relevant phrases to be recognized, whichoften includes rare phrases or even foreign words which are seeninfrequently in training. To perform contextual biasing, conventionalASR systems sometimes model contextual information in an independentcontextual language model (LM), using an n-gram weighted finite statetransducer (WFST), and compose the independent contextual LM with abaseline LM for on-the-fly (OTF) rescoring.

Recently, end-to-end (E2E) models have shown great promise for ASR,exhibiting improved word error rates (WERs) and latency metrics ascompared to conventional on-device models. These E2E models, which foldthe acoustic model (AM), pronunciation model (PM), and LMs into a singlenetwork to directly learn speech-to-text mapping, have shown competitiveresults compared to conventional ASR systems which have a separate AM,PM, and LMs. Representative E2E models include word-based connectionisttemporal classification (CTC) models, recurrent neural networktransducer (RNN-T) models, and attention-based models such as Listen,Attend, and Spell (LAS). Because E2E models maintain a limited number ofrecognition candidates during beam-search decoding, contextual ASR canbe challenging for E2E models.

SUMMARY

Implementations herein are directed toward biasing a speech recognitionmodel toward a set of phrases relevant to a current context. The set ofphrases are dynamic in that the phrases may change as the contextchanges. The speech recognition model may further take into account thepronunciations of the phrases used for biasing in addition to writtenrepresentations of the phrases. For example, the model can use textinformation (e.g., grapheme sequences) and pronunciation information(e.g., phoneme sequences) representing biasing phrases to select whichphrases are most applicable to speech being recognized. This allows thecontextual biasing aspect of the speech model to be aware of thepronunciations of individual biasing phrases, thereby enabling the modelto achieve much higher accuracy when recognizing rare words and wordswith unusual pronunciations. The contextual biasing techniques discussedherein can be used with various types of speech recognition models,including end-to-end models that can generate transcription data withouta separate acoustic model, pronunciation model, and language model.

To provide high speech recognition accuracy, general purpose automaticspeech recognition (ASR) systems may use information indicating thecontext in which speech occurs. The context may be related to, forexample, the user's contacts, calendar appointments, open apps, andlocation. One example of an ASR system contains separate acoustic,pronunciation, and language models. Other ASR systems combine theacoustic, pronunciation, and language models as a single neural network.A single neural network model improves simplicity and quality, andoptimizes word error rate (WER).

For conventional neural network-based ASR models, it is challenging torecognize words that infrequently appear in a language. The distributionof words in a language typically follows a Zipfian distribution, where asmall number of words are used very frequently, and vast numbers ofwords are rarely used. It can be difficult to obtain a large enough setof training data with audio and corresponding text to effectivelyprovide examples of rare words. Even if sufficient training data isavailable, adding more and more training examples often yieldsimprovements of lower and lower magnitude. Additionally, it ischallenging for these models to recognize words with unusualpronunciations relative to their spelling.

Incorporating contextual biasing into a neural network-based ASR modelcan improve recognition for rare words and words with unusualpronunciations. One useful application is to better recognize propernames (i.e., proper nouns such as people names, song names, city names,etc.), which may be relatively rare in training data and/or may notfollow typical pronunciation rules. For example, a smart phone or otheruser device often stores a user's contacts. When a user is using amessaging or phone calling application, this information can be used ascontext to help the ASR system recognize names spoken by a user. In asimilar manner, a list of songs in a music library of a user can be usedto bias speech recognition, for example, when the user is using a mediaplayer application. Implementations herein are directed toward applyingcontextual biasing to an ASR model by injecting information associatedwith both written forms and pronunciations of biasing phrases into theASR model. Contextually biasing the ASR model incorporates knowledge ofrare word pronunciations even if the words were not observed duringtraining of the ASR model.

One aspect of the disclosure provides a method for biasing speechrecognition that includes receiving, at data processing hardware, audiodata encoding an utterance, and obtaining, by the data processinghardware, a set of one or more biasing phrases corresponding to acontext of the utterance, each biasing phrase in the set of one or morebiasing phrases includes one or more words. The method also includesprocessing, by the data processing hardware, using a speech recognitionmodel, acoustic features derived from the audio data and grapheme andphoneme data derived from the set of one or more biasing phrases togenerate an output of the speech recognition model. The method alsoincludes determining, by the data processing hardware, a transcriptionfor the utterance based on the output of the speech recognition model.

Implementations of the disclosure may include one or more of thefollowing optional features. In some implementations, the speechrecognition model includes a first encoder, a first attention module, agrapheme encoder, a phoneme encoder, a second attention module, and adecoder. The first encoder is configured to receive, as input, theacoustic features and generate, as output, audio encodings from theacoustic features. The first attention module is configured to receive,as input, the audio encodings output from the first encoder andgenerate, as output, first attention outputs. The grapheme encoder isconfigured to receive, as input, grapheme data indicating graphemes ofeach word in the set of one or more biasing phrases and generate, asoutput, grapheme encodings. The phoneme encoder is configured toreceive, as input, phoneme data indicating phonemes of each word in theset of one or more biasing phrases and generate, as output, phonemeencodings. The second attention module is configured to receive, asinput, a representation of the grapheme encodings output from thegrapheme encoder and the phoneme encodings output from the phonemeencoder and generate, as output, second attention outputs. The decoderis configured to determine likelihoods of sequences of speech elementsbased on the first attention outputs and the second attention outputs.

In some examples, for each particular word of each biasing phrase in theset of one or more biasing phrases: the grapheme encoder is configuredto generate a corresponding grapheme encoding for the particular word;the phoneme encoder is configured to generate a corresponding phonemeencoding for the particular word; and the second attention module isconfigured to encode a corresponding second attention output thatcomprises a corresponding contextual biasing vector for the particularword based on the corresponding grapheme and phoneme encodings for theparticular word. In addition examples, the representation of thegrapheme encodings output from the grapheme encoder and the phonemeencodings output from the phoneme encoder includes a projection vectorrepresenting a concatenation between the grapheme encodings and thephoneme encodings. Additionally, the first encoder, the first attentionmodule, the grapheme encoder, the phoneme encoder, the second attentionmodule, and the decoder may be trained jointly to predict a sequence ofgraphemes from a sequence of acoustic feature frames.

In some implementations, the first attention module is configured tocompute attention as a function of a previous hidden state of thedecoder and a full sequence of audio encodings output by the firstencoder, while the second attention module is configured to computeattention as a function of the previous hidden states of the decoder anda full sequence of projection vectors representing characteristics ofboth the grapheme and phoneme data derived from the biasing phrases inthe set of one or more biasing phrases. In additional implementations,the second attention module is configured to receive a contextualbiasing vector that does not correspond to any of the biasing phrases inthe set of one or more biasing phrases, the contextual biasing vectorrepresenting an option to not bias the output of the speech recognitionmodel. In some examples, the speech elements include graphemes. In otherexamples, the speech elements include words or wordpieces. Optionally,the speech elements may include phonemes.

The set of one or more biasing phrases may include one or more contactnames personalized for a particular user. Additionally or alternatively,the set of one or more biasing phrases may include one or more calendarevents personalized for the particular user. In some examples, themethod also includes determining, by the data processing hardware, thecontext of the utterance based on at least one of: a location of a userthat spoke the utterance; one or more applications open on a user deviceassociated with the user that spoke the utterance; or a current dateand/or time of the utterance. The context of the utterance mayadditionally or alternatively be based on one or more previous commandsissued by the user.

In some implementations, the speech recognition model includes a decoderconfigured to determine a hidden state and the output of the speechrecognition model based on an embedding vector, a previous hidden stateof the decoder; a first vector; and a second vector. Here, the embeddingvector is for a previous grapheme output by the speech recognitionmodel, while the first vector is output by a first attention module andthe second vector is output by a second attention module.

Another aspect of the disclosure provides a system for biasing speechrecognition based on a context of an utterance. The system includes dataprocessing hardware and memory hardware in in communication with thedata processing hardware and storing instructions that when executed onthe data processing hardware cause the data processing hardware toperform operations. The operations include receiving audio data encodingan utterance and obtaining a set of one or more biasing phrasescorresponding to a context of the utterance, each biasing phrase in theset of one or more biasing phrases includes one or more words. Theoperations also include processing, using a speech recognition model,acoustic features derived from the audio data and grapheme and phonemedata derived from the set of one or more biasing phrases to generate anoutput of the speech recognition model. The operations also includedetermining a transcription for the utterance based on the output of thespeech recognition model.

This aspect may include one or more of the following optional features.In some implementations, the speech recognition model includes a firstencoder, a first attention module, a grapheme encoder, a phonemeencoder, a second attention module, and a decoder. The first encoder isconfigured to receive, as input, the acoustic features and generate, asoutput, audio encodings from the acoustic features. The first attentionmodule is configured to receive, as input, the audio encodings outputfrom the first encoder and generate, as output, first attention outputs.The grapheme encoder is configured to receive, as input, grapheme dataindicating graphemes of each word in the set of one or more biasingphrases and generate, as output, grapheme encodings. The phoneme encoderis configured to receive, as input, phoneme data indicating phonemes ofeach word in the set of one or more biasing phrases and generate, asoutput, phoneme encodings. The second attention module is configured toreceive, as input, a representation of the grapheme encodings outputfrom the grapheme encoder and the phoneme encodings output from thephoneme encoder and generate, as output, second attention outputs. Thedecoder is configured to determine likelihoods of sequences of speechelements based on the first attention outputs and the second attentionoutputs.

In some examples, for each particular word of each biasing phrase in theset of one or more biasing phrases: the grapheme encoder is configuredto generate a corresponding grapheme encoding for the particular word;the phoneme encoder is configured to generate a corresponding phonemeencoding for the particular word; and the second attention module isconfigured to encode a corresponding second attention output thatcomprises a corresponding contextual biasing vector for the particularword based on the corresponding grapheme and phoneme encodings for theparticular word. In addition examples, the representation of thegrapheme encodings output from the grapheme encoder and the phonemeencodings output from the phoneme encoder includes a projection vectorrepresenting a concatenation between the grapheme encodings and thephoneme encodings. Additionally, the first encoder, the first attentionmodule, the grapheme encoder, the phoneme encoder, the second attentionmodule, and the decoder may be trained jointly to predict a sequence ofgraphemes from a sequence of acoustic feature frames.

In some implementations, the first attention module is configured tocompute attention as a function of a previous hidden state of thedecoder and a full sequence of audio encodings output by the firstencoder, while the second attention module is configured to computeattention as a function of the previous hidden states of the decoder anda full sequence of projection vectors representing characteristics ofboth the grapheme and phoneme data derived from the biasing phrases inthe set of one or more biasing phrases. In additional implementations,the second attention module is configured to receive a contextualbiasing vector that does not correspond to any of the biasing phrases inthe set of one or more biasing phrases, the contextual biasing vectorrepresenting an option to not bias the output of the speech recognitionmodel. In some examples, the speech elements include graphemes. In otherexamples, the speech elements include words or wordpieces. Optionally,the speech elements may include phonemes.

The set of one or more biasing phrases may include one or more contactnames personalized for a particular user. Additionally or alternatively,the set of one or more biasing phrases may include one or more calendarevents personalized for the particular user. In some examples, theoperations also include determining the context of the utterance basedon at least one of: a location of a user that spoke the utterance; oneor more applications open on a user device associated with the user thatspoke the utterance; or a current date and/or time of the utterance. Thecontext of the utterance may additionally or alternatively be based onone or more previous commands issued by the user.

In some implementations, the speech recognition model includes a decoderconfigured to determine a hidden state and the output of the speechrecognition model based on an embedding vector, a previous hidden stateof the decoder; a first vector; and a second vector. Here, the embeddingvector is for a previous grapheme output by the speech recognitionmodel, while the first vector is output by a first attention module andthe second vector is output by a second attention module.

The details of one or more implementations of the disclosure are setforth in the accompanying drawings and the description below. Otheraspects, features, and advantages will be apparent from the descriptionand drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic view of an example automatic speech recognitionsystem including a speech recognition model that biases speechrecognition results based on a context relevant to an utterance.

FIG. 2 is a schematic view of an example selection process for selectingbiasing phrases.

FIG. 3 is a schematic view of an example architecture of the speechrecognition model of FIG. 1 .

FIG. 4 is a schematic view of an example training process for trainingthe speech recognition model of FIG. 1 .

FIG. 5 is a flowchart of an example arrangement of operations for amethod of contextual-biasing a speech recognition model.

FIG. 6 is a schematic view of an example computing device that may beused to implement the systems and methods described herein.

Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

Conventional automated speech recognition (ASR) systems rely on threemodels: an acoustic model, a pronunciation model, and a language model.Design decisions for the pronunciation model influence both the acousticmodel and the language model. For the pronunciation model, a set aphonemes is chosen to represent all of the possible distinct units ofsound, thus determining the set of labels the acoustic model has topredict. All of the words in the language model have a pronunciation.

An accurate pronunciation model has a large coverage. For a givenvariant of a language, the phoneme set varies from region to region. Forexample, in American English, native speakers do not agree whether wordslike “pen” and “pin” are pronounced the same way. Additionally, certainwords can have reduced pronunciations, where speakers may or may notpronounce the letter “t” in words like “twenty.” For extremely commonwords like “the,” it is unnatural to hard-code pronunciations.

End-to-end (E2E) speech recognition models combine the acoustic,pronunciation, and language models into a single neural network. Asingle neural network model improves simplicity and quality, andoptimizes word error rate (WER). However, a challenge in E2E speechrecognition models is optimizing performance on recognizing words thatappear infrequently in a language and/or have unusual pronunciationsrelative to their spelling. While training data can include bothhuman-transcribed voice data and text-only data, the use of largetraining data sets for training these E2E speech recognition models isinefficient. As the distribution of words in a language typically followa Zipfian distribution, where a small number of words are used veryfrequently, and vast numbers of words are rarely used, increasing thenumber of training examples typically yields improvements of lower andlower magnitude.

Incorporating contextual biasing into a neural network ASR model canimprove recognition for rare words and words with unusualpronunciations. For example, since a user's contacts are often stored ona smart phone, they can be used as context to help the ASR systemrecognize the names of contacts spoken by a user. Contextual biasing canbe applied to ASR models by injecting both biasing context andpronunciation into the model. The contextually biased model retains theadvantages of neural network models, including simple, unified trainingand implicit learning of the pronunciation of rare words. Thecontextually biased model incorporates knowledge of rare wordpronunciation even if the words have never been present duringtraining.\

Referring to FIG. 1 , an example ASR system 100 includes a speechrecognition model 300 that incorporates contextual biasing to improvespeech recognition accuracy. The speech recognition model 300 includesan audio encoder 110, a grapheme encoder 122, a phoneme encoder 126, anaudio attention module 132, a contextual biasing attention module 134,and a decoder 142 that receives, as input, outputs from both of theaudio and contextual biasing attention modules 132, 134. In someexamples, an output of the decoder 142 is normalized with a softmaxlayer 144, whereby the softmax layer 144 outputs a probabilitydistribution over sequences of output graphemes. The softmax layer 144may provide a probability distribution over a set of language units(e.g., speech elements), such as a set of graphemes. For example, theoutput for the softmax layer 144 can be a vector having a probabilityvalue for each of the graphemes in a language, as well as potentiallyfor other symbols (e.g., punctuation, space, etc.). The sequence ofvectors from the softmax layer 144 is used to produce a transcription150 of a speech utterance 104 recorded by a user device 106. In someexamples, the speech recognition model 300 resides on a user device 106associated with a user 102. In other examples, the speech recognitionmodel 300 resides on a remote server in communication with the userdevice 106 or functionality of the speech recognition model 200 is splitamong the remote server and the user device 106

In the example shown, the user 102 speaks an utterance 104 captured byone or more microphones of the user device 106. The user device 106 mayinclude a mobile device, such as a smart phone, tablet, smartheadphones, smart watch, etc. The utterance 104 spoken by the user 102may be a command, “Navigate to Bexar Courthouse”, in which “Bexar” is aname of a county pronounced /ber/, similar to the words “bear” and“bare”. The one or more microphones of the user device 106 generate anaudio signal 108 from the spoken command. The audio signal(interchangeably referred to as “audio data”) encodes the utterance 104and can be processed in any of various ways to determine inputs for theneural network elements of the speech recognition model 300. Forexample, a feature extraction module (not shown) can generate acousticfeatures 109 derived from the audio signal 108, such as Mel-frequencycepstral components for different windows or frames of the audio signal108.

The user device 106 also provides context information 114 to the speechrecognition model 300. Context information 114 can include, for example,a user's location, the apps that are open on the device 104, contacts ofthe user, and upcoming calendar appointments. In this case, contextinformation 114 may include the user's current GPS location within BexarCounty, a navigation application 107 open on the user device 106, andthe user's upcoming calendar appointment at the courthouse. The calendarappointment could be obtained by accessing a calendar application 107executable on the user device 106. The user 102 may explicitly opt-in toconsent to sharing each type of context information 114 individually,and may opt-out at any time to revoke the previously consented tosharing of one or more types of context information 114.

The speech recognition model 300 uses the context information 114received from the user device 106 to select particular biasing phrases116, each of which may include both grapheme data 118 and phoneme data120. The grapheme data 118 is input to the grapheme encoder 122, whilethe phoneme data 120 is input to the phoneme encoder 126. Additionaldetail regarding the selection of biasing phrases 116 is provided inFIG. 2 .

In some implementations, the audio, grapheme, and phoneme encoders 110,122, 126 include sequence models having long short-term memory (LSTM)elements or other recurrent neural network (RNN) elements. The audioencoder 110 may include a first encoder configured to generate anencoded audio vector (e.g., audio encodings) 112 based on the audiosignal 108. For instance, the audio encoder 110 may receive acousticfeatures 109 derived from the audio signal 108 encoding the utterance104 and generate the encoded audio vector 112 from the acoustic features109. The encoded audio vector 112 is input to the audio attention module132. The grapheme encoder 122 is configured to generate an encodedgrapheme vector 124 from grapheme data 118 associated with the selectedbiasing phrases 116, while the phoneme encoder 126 is configured togenerate an encoded phoneme vector 128 from the phoneme data 120associated with the selected biasing phrases 116. In the example shown,the speech recognition model 300 concatenates the encoded graphemevector 124 and the encoded phoneme vector 128 into a projection vector130 input to the contextual biasing attention module 134. Accordingly,the projection vector 130 may correspond to a representation of theencoded grapheme and phoneme vectors 124, 128.

The audio attention module 132 is configured to compute attentionweights and generate a weighted audio encoding vector 136 that includesa vector summary of which audio frames the model 300 is currentlyattending to. For instance, the audio encoder 110 can generate an audiovector 112 for each window or frame of audio data 108 as additionalacoustic features 109 are received, and so continues to provideadditional audio vectors 112 as more audio is received. In someexamples, the audio attention module 132 computes attention weights overthe sequence of audio vectors 112 using a decoder audio state 141 from aprevious time step and the encoded audio vector 112. The decoder audiostate 141 may represent a portion of the state of the decoder 142 whenoutputting the most recent decoder output, e.g., a portion of thedecoder state for the previous time step representing the immediatelyprevious window or frame of audio. The decoder state 141 helps informthe audio attention module 132 of the relative importance of the audiovector 112 input to the audio attention module 132 at the current timestep. The decoder audio state 141 can be a portion of the output of thedecoder 142, such as a designated subset of the outputs of the decoder142, typically a different set of outputs than used in a decoder contextstate 143. In some implementations, the state vectors 141, 143 representnon-overlapping portions of the output of the decoder 142. In otherimplementations, the state vectors 141, 143 include one or moreoverlapping sections or both state vectors 141, 143 each represent thewhole output of the decoder 142

At the same time, the contextual biasing attention module 134 isconfigured to compute attention weights using both of the encodedgrapheme and phoneme vectors 124, 128 included in the projection vector130. The contextual biasing attention module 134 uses these attentionweights to only compute a weighted sum of the grapheme embeddings, whichare output as a contextual biasing vector 138. More specifically, thecontextual biasing attention module 134 processes the projection vectors130 and a decoder context state 143 from a previous time step togenerate the contextual biasing vector 138. The decoder context state143 includes the state of the decoder 142 at the previous time step withrespect to the previous contextual biasing vectors 138. The decodercontext state 143 may represent a portion of the output of the decoder142, such as a designated subset of the outputs of the decoder 142. Thedecoder context state 143 informs the contextual biasing attentionmodule 134 of the relative importance of the projection vector 130 inputto the contextual biasing attention module 134 at the current time step.Thus, as the decoder context state 143 changes, the contextual biasingattention module 134 calculates a different summary or aggregation ofthe projection vectors 130, thus changing the probabilities that thedecoder 142 will indicate for elements of the different biasing phrases116.

By having the contextual biasing attention module 134 compute only theweighted sum of the grapheme embeddings, the model 300 does not learn ana-priori relationship between spelling and pronunciation, therebypermitting the model to assign an arbitrary sequence of phonemes to anyword. However, since the contextual biasing attention module 134 doesuse the encoded phoneme vector 128, in addition to the encoded graphemevector 124, to compute the attention weights, the weighted sum of thegrapheme embeddings computed from the attention weights contains phonemeinformation in addition to grapheme information.

Notably, the contextual biasing vector 138 output from the contextualbiasing attention module 134 represents contextual biasing of the speechrecognition model 300, and includes both grapheme and phonemeinformation. Accordingly, concatenating the contextual biasing vector138 with the weighted audio encoding vector 136 into a weighted vector“injects” contextual biasing into the speech recognition model 300. Theweighted vector 140 collectively represents the audio, grapheme, andphoneme information. The weighted vector 140 is input to the decoder142.

In some configurations, the decoder 142 includes a sequence model suchas an LSTM network and is configured to extract an output sequence fromthe weighted vector 140. The output of the decoder 142 is alsonormalized with a softmax layer 144 to produce a probabilitydistribution over a set of output targets, which are graphemes in theillustrated example. In other examples, the set of output targets in theprobability distribution may include words, wordpieces or phonemes. Abeam search process may use the various output vectors from the decoder142 to produce the transcription 146 for the utterance 104. The decisionmade about which grapheme represents the audio at the current time stepis fed back to the decoder 142 as a speech element output 148 and isused to compute the next output from the decoder 142. In the exampleshown, the speech element output 148 includes a grapheme output. Inother examples, the speech element output 148 includes a word output ora wordpiece output. In yet other examples, the speech element output 148includes a phoneme output.

In the example of FIG. 1 , the context information 114 of the user's 102location, open apps, and upcoming calendar appointment biases the speechrecognition model 300 to the biasing phrase “Bexar” and thepronunciation /ber/. This enables the speech recognition model 300 tocorrectly transcribe the user's 102 command as “Navigate to BexarCourthouse.” The user device 106 can then perform an action based onthis transcription, such as mapping a route to the courthouse within anavigation app.

FIG. 2 shows a schematic view of an example process 200 for generating aset of one or more biasing phrases 116 relevant to a context of anutterance 104. The process 200 includes identifying a context 202 of thespeech utterance 104, and selecting context information 114 from datasources 210 to generate the biasing phrases 116, including both graphemedata 118 and phoneme data 120. Each biasing phrase 116 in the set of oneor more biasing phrases includes one or more words, such that thegrapheme data 118 includes the graphemes associated with each word ofthe biasing phrases 116 and the phoneme data 118 includes the phonemesassociated with each word of the biasing phrases 116.

In more detail, the process 200 begins when a user speaks an utterance104. When an ASR system receives audio input from the spoken utterance104, one or more contexts 202 of the utterance 104 are determined. Thecontext 202 determines the biasing phrases 116 that are selected for usein contextual biasing.

The context 202 can be, for example, the location of the user. Thelocation of a user can be determined by GPS coordinates. For example,the context could be that the user is at home, work, or a geographiclocation near the user's home or work. The user's location could also beidentified as a venue such as a restaurant, concert hall, or stadium. Ifthe user is traveling, the context could be that the user is on vacationor traveling for work, perhaps to an unfamiliar or foreign city.

Another example of context 202 is the apps that are open on a user'smobile device. If the user opens the telephone app, the user may beabout to make a telephone call to a contact. If the user opens anavigation app, the user may be about to navigate to a destination.

An example of a context 202 can also be the date and time of the spokenutterance. For example, if it is early on a Monday morning, the user maybe preparing to drive to work. Thus, the user may speak an utterance 104that will help him or her prepare for the work day, such as to ask forweather or traffic conditions or to navigate to the office.

A selection module 203 references the context 202 to select information114 from various data sources 210. The data sources 210 can include, forexample, contacts 204, calendar events 206, and previously spokencommands 208. In one example, if the context 202 includes an opentelephone app, the selection module 203 may select information 114 fromstored contacts 204. If the context 202 includes an open navigation app,the selection module 203 may select information 114 regarding thelocation of upcoming calendar events. The selection module 203 may alsoreference previous commands 208. For example, the user may ask tonavigate to work every weekday morning. Therefore, an utterance 104spoken on a Monday morning will likely include words that have recentlybeen spoken by the user related to the user's work location. The contextinformation 114 may also specify the one or more contexts 202 relevantto the spoken utterance 104.

The information 114 obtained/selected by the selection module 203 isused to create the biasing phrases 116. Biasing phrases 116 include bothgrapheme data 118 and phoneme data 120. The phoneme data 120 may begenerated using a pronunciation lexicon 216.

As an example, a user may leave home most weekday mornings, drive toBowie, Md. to drop off a child at daycare, then continue to Baltimore,Md., for work. When the user speaks the utterance 104 “Navigate toBowie, Md.,” the context 202 for the utterance 104 can include thelocation of the user, which is at home, in Maryland. The context 104 canalso include an open navigation app, and the date and time, which may bea Monday morning. The selection module 203 accesses the data sources210. The calendar events 206 show that the user has a meeting inBaltimore, Md., later that day. The previous commands 208 show that onmost weekday mornings, the user asks to navigate to Bowie and thenBaltimore. The selection module 203 uses this context information 114 tocreate biasing phrases 212.

In this example, the grapheme biasing data 118 may include the words“Baltimore,” “Maryland,” and “Bowie,” as the user has a navigation appopen, and these three words are related to likely destinations of theuser. The pronunciation lexicon 216 is used to create phoneme biasingdata 120. The user device 106 accesses the pronunciation lexicon 216 andretrieves pronunciations, e.g., sequences of phonemes, which representcommon pronunciation of words. A pronunciation lexicon may includevariations of pronunciations of words. For example, someone who is fromBaltimore may pronounce “Baltimore” differently from someone who isvisiting from another state. Likewise, if a traveler is visiting acountry where the spoken language is not their native language, thetraveler's pronunciation of a place name is likely different from anative speaker's pronunciation.

This process 200 generates grapheme biasing data 118 and phoneme biasingdata 120. The biasing phrases 116 are then fed to the grapheme encoder124 and phoneme encoder 126 for injection into the ASR model 300, asillustrated in FIG. 1 .

FIG. 3 illustrates an example architecture of the speech recognitionmodel 300 of the ASR system 10 of FIG. 1 for contextual biasing usinggraphemes and phonemes. The example architecture for the model 300depicted FIG. 3 includes a listen, attend, and spell (LAS) model. TheLAS architecture includes three main components: the audio encoder(e.g., first encoder) 110; the audio attention module (e.g., firstattention module) 132); and the decoder 142. The audio encoder 110 isconfigured to receive a time-frequency representation (i.e., acousticfeatures) 109 of the input speech audio signal 108 and uses a set ofneural network layers to map the input 109 to a higher-level featurerepresentation, i.e., the encoded audio vector (e.g., audio encodings)112. The encoded audio vector 112 output of the audio encoder 110 ispassed to the audio attention module 132, which uses the encoded audiovector 112 to learn an alignment between the input acoustic features 109and predicted graphemes 148. The output 136 of the attention audiomodule 132 is passed to the decoder 142, which produces a probabilitydistribution over a set of hypothesized graphemes. The components of theLAS model are trained jointly as a single end-to-end neural network.

The example LAS architecture is enhanced to include additionalcomponents to enable the model 300 to bias speech recognition results146, 148 toward biasing phrases 116 relevant to a given context 202.These additional components include the grapheme encoder (e.g., secondencoder) 122, the phoneme encoder (e.g., third encoder) 126, and thebias attention module (e.g., second attention module) 134.

The enhanced architecture of the speech recognition model 300 acceptsaudio, grapheme, and phoneme inputs x_(u), g_(n), and p_(n), where thesubscripts correspond to the U audio frames 109 and the N biasingphrases 116. The encoders 110, 122, 126 encode the respective inputsinto encodings h^(x), h^(g), and h^(p), which are then fed into thecorresponding attention modules 132, 134. The attentions by theattention modules 132, 134, in turn, output corresponding contextattention vectors c_(t) ^(x) and c_(t) ^(g) fed to the decoder 142. Thedecoder 142 outputs d_(t) ^(x) and d_(t) ^(g) that are used for the nextdecoding steps to compute the next attentions' outputs. The output ofthe example LAS model 300 is a vector y_(t) for every decoding step tfrom 1 to T.

In some implementations, the audio encoder 110 includes a bidirectionalLSTM configured to receive, as input, audio frames x_(I) . . . x_(U)(e.g., acoustic features 109 such as log-mel features). In one example,the audio encoder 110 includes ten LSTM layers, each having 256 nodesand a hidden size of 1400. Additionally, the grapheme and phonemeencoders 122, 126 may each include a single layer LSTMs with 512dimensions. The grapheme encoder 122 and the phoneme encoder 126 mayeach receive, as input, respective grapheme data 118 and phoneme data120 associated with N randomly shuffled biasing phrases 116.Accordingly, the number of phoneme inputs (e.g., phoneme data) 120 thatconvey pronunciation information may be identical to the number ofgrapheme inputs (e.g., grapheme data) 118, and the correspondingembedding/encodings 124, 128 output from the encoders 122, 128 arecomputed independently from one another.

In some examples, the audio attention module 132 includes four headswith 512 coordinates. In the example shown, the audio attention module132 receives two inputs: the audio encoding 112, h^(x), that includesthe encoding of the audio frames 109; and the output 141 of the decoder142 at the previous time step, d^(x) _(t-1). The superscript x informsthe audio attention module 132 that the output 141 corresponds to adecoder audio state representing a portion of the state of the decoder142 when outputting the most recent decoder output, e.g., a portion ofthe decoder state for the previous time step representing theimmediately previous window or frame of audio. Thus, the output 141associated with the decoder audio state helps inform the audio attentionmodule 132 of the relative importance of the audio vector 112, h^(x),input to the audio attention module 132 at the current time step. Theoutput 141 corresponding to the decoder audio state can be a portion ofthe output of the decoder 142, such as a designated subset of theoutputs of the decoder 142, typically a different set of outputs thanused in the output 143 corresponding to the decoder context state.

On the other hand, the bias attention module 134 is configured toreceive, as input, an encoded projection 130 corresponding to theconcatenation between the grapheme encoding 124, h^(g), output fromgrapheme encoder 122 and the phoneme encoding 128, h^(p), output fromthe phoneme encoder 126 based on the grapheme and phoneme inputs 118,120. A projection layer 132 may concatenate the encodings 124, 128 andoutput the corresponding encoded projection 130. More specifically, thebias attention module 134 attends, at every decoding step t, on aweighted sum of the encoded projection 130. Here, the bias attentionmodule 134 may use both the grapheme and phoneme encodings 124, 128 tocompute the attention weights. Once the attention weights are computed,they are used to compute a weighted sum c_(t) ^(g) of the graphemeembeddings. Thus, output 138 of the bias attention module 134 includesonly the weighted sum c_(t) ^(g) of the grapheme encodings 124, h^(g).

In more detail, the output 138 of the bias attention module 134 iscomputed in three steps. In the first step, the bias attention modules134 receives, as input, the grapheme encodings h_(t) ^(g), phonemeencodings h_(t) ^(p), and the outputs 143 from the previous decodingsteps, d^(g) _(t-1). The superscript g informs the bias attention module134 that the output 143 corresponds to a decoder grapheme staterepresenting a portion of the state of the decoder 142 when outputtingthe most recent decoder output. In some examples, the bias attentionmodule 134 computes a set of weights for every biasing word and everydecoding step u^(g) _(it) as follows:u _(it) ^(g) =v ^(g) ^(T) ·tan h(W _(h) ^(g) h _(t) ^(g) +W _(h) ^(p) h_(t) ^(p) +W _(d) ^(g) d _(t) +b ^(g))  (1)

In the second step, the output is normalized with a softmax 144 toobtain α_(t) ^(g) as follows:α_(t) ^(g)=softmax(u _(t) ^(g))

In the third step, the output 138 is computed as weighted sum of theembeddings h_(t) ^(g).

$c_{t}^{g} = {\sum\limits_{i = 1}^{N}\;{\alpha_{ii}^{g}h_{i}^{g}}}$

The bias attention output 138, c_(t) ^(g), represents an embedding ofgraphemes. Thus, in some examples the decoder 142 is configured tooperate on graphemes only, therefore the decoder input c_(t) ^(g) is interms of graphemes only. This enables the injection of words for whichthe pronunciation is difficult to predict from the spelling. With thisapproach, there is no a-priori relationship between spelling andpronunciation, and an arbitrary sequence of phonemes can be assigned toany word.

However, in configurations when the decoder 142 is configured to producea probability distribution over a set of output targets other thangraphemes, such as wordpiecies, the bias attention output 138 may be interms of the type of output target the decoder 142 is configured toproduce. Optionally, the output of the bias attention module mayrepresent embeddings of phonemes, e.g., a weighted sum of phonemes, andthe decoder 142 may instead operate on phonemes only to produce a set ofoutput targets associated with phonemes.

In some examples, the decoder 142 includes four LSTM layers, each havinga size of 256 with a hidden dimension of 1,024. In the example shown,the decoder 142 receives three inputs. The first input is the previousgrapheme output 138, y_(t-1). When the model is used to makepredictions, y_(t-1) is the output of the softmax 144. When the model istraining, y_(t-1) can either be a one-hot encoding of the true grapheme,or a sampling of the output and the one-hot encoding of the truegrapheme. Another input 136 to the decoder 142, c_(t) ^(x), includes asummary of the audio frames 109 the model 300 is currently payingattention to. The third input 138 to the decoder 142 includes the biasattention output c_(t) ^(g).

With continued reference to FIG. 3 , the output of the example LAS model300 includes a vector y_(t) for every decoding step t from 1 to T. InASR, the vector y_(t) corresponds to the one-hot encoding of the outputsymbols (graphemes, space, and end-of-sequence). The model 300 outputsT−1 symbols until it predicts an end-of-sequence symbol at the T^(th)step.

In some implementations, a training process trains the speechrecognition model 300 on multiple batches of training examples. Here,for each batch, a reference transcript is randomly kept with probabilityP_(keep). This allows examples for which there is no matching biasing.For the remaining reference transcripts, n-grams are randomly selected,where n is uniformly sampled from [1:N_(order)]. The number of selectedn-grams is N, where N is uniformly sampled from [1:N_(phrases)]. Examplevalues are Pkeep=0:5, Nphrases=1, and Norder=4. Accordingly, the amountof biasing phrases 116 in each batch is variable and not specified atthe time of training the model 300. Moreover, one or more biasingphrases 116 applied during training may have different lengths than theother biasing phrases 116. The model 300 can be trained using log-melacoustic features of every 10 ms, computed for windows of 25 ms, andstacked by groups of 3. An example acoustic feature vector has 80coordinates.

A no-biasing option can be added, where the biasing graphemes andphonemes are empty. The no-biasing option can be applied by applying acontextual biasing vector 138 that does not include grapheme and biasingdata 118, 120 associated with any biasing phrase 116. This samplingapproach ensures that some examples do not have matching words, whileothers do. A special token is introduced to help the model converge.After each matching bias token, a special character </bias> is added. Itintroduces a biasing error that can only be corrected by using thebiasing phrase.

For training the phonemes, a mix of lexicon and grapheme to phoneme(G2P) can be used. If a word is in the lexicon, its correspondingpronunciation can be used. If a word is not in the lexicon, G2P can beused to predict the pronunciation.

Injecting biasing context, including pronunciation knowledge, into aneural network model results in a 14% relative improvement than if nobiasing context is injected. The contextually biased model retains theadvantages of neural network models such as simple, unified training andimplicit learning of pronunciation of rare words. At the same time, thecontext biased-model incorporates knowledge of rare word pronunciationeven if they have never been present during training. However, if thecontext contains similarly sounding phrases (e.g., when both “Joan” and“John” are in the user's contact list), disambiguation of the correctphrase remains challenging. Disambiguation can be improved by trainingan ASR context mechanism with difficult negative examples.

In the above training example, training bias phrases are created byrandomly sampling n-grams from a reference transcript and otherutterances in the training data. At test time, the bias phrases arefixed in advance and are from the same semantic category (e.g., contactnames). In this case, the test task is harder than the training task,since distinguishing between first names such as Joan and John isusually more challenging than distinguishing between random unrelatedn-grams.

FIG. 4 illustrates an example method 400 that can be used to close thetrain-test discrepancy that results from training with randomly sampledn-grams. The method 400 detects proper nouns from a reference transcriptinstead of sampling random n-grams, and augments them with phoneticallysimilar, or “fuzzy,” alternatives, to create sets of bias phrases fortraining.

Method 400 focuses on proper nouns (e.g., unique entities such as namesof people and places) in reference transcripts, and uses phoneticallysimilar phrases as negative examples, encouraging the neural networkmodel to learn more discriminative representations. This approach can beapplied to a neural network contextual ASR model that jointly learns totranscribe and select the correct context items in order to improve WERby up to 53.1%.

In ASR, two phrases can be very similar to each other phonetically(i.e., in the way they sound), but be unquestionably different (e.g.,“call Joan” and “call John”). For a neural network ASR model, thelearned representations for these names might be very similar, leadingthe model to predict the wrong one. This problem is especiallychallenging for ASR models presented with rare and difficult-to-spellwords, as the model might not observe these words at all duringtraining, and will thus fail to spell them correctly at test time.

Training using negative examples is a way to teach the model todistinguish these rare and difficult-to-spell phrases from phoneticallysimilar ones. Using this approach, the model learns more robustrepresentations that help it to perform better at test time. The methodfor training using negative examples includes: (i) detectingdifficult-to-transcribe and rare words in the input utterance as thetarget of focus; and (ii) training harder on these words by providingthe model with difficult negative examples.

For detection of the phrases to focus on ((i)), proper nouns (alsotagged as “NNP”) are the general category of phrases that are rare andusually more difficult to transcribe, while being relatively easy todetect. For training harder on these phrases ((ii)), phoneticallysimilar alternative phrases are extracted and fed to the model asnegative examples.

This approach can be thought of as data augmentation for speech. Whiledata augmentation is usually used in machine learning for generatingmutated positive examples, this approach is used for generating mutatednegative examples. This approach is applied to the contextually biasedLAS model to improve WER.

In the example of FIG. 4 , the utterance “Call Joan's mobile” isprovided to an ASR model. When the example utterance 410 is presented,the method 400 detects proper nouns 415 from the reference transcript405.

In this example, the word “Joan” 420 is selected as a bias phrase.“Joan” 420 is easily identified as a proper noun because it is a name ofa person. To identify proper nouns, the reference transcript is analyzedwith a part-of-speech tagger.

The method 400 then adds phonetically similar, or fuzzy, phrases asalternatives 425. In this case, the words “John” and “Jean” 430 areselected as alternative bias phrases. A word n-gram is identified as afuzzy alternative of a second word n-gram if both are phoneticallysimilar and co-occur often in different decoding beams of the sameutterances in training data. The process of identifying fuzzyalternatives is described as follows.

First, a fuzzy inventory is built that stores for each n-gram, a set offuzzy n-gram alternatives. The fuzzy inventory is built in anunsupervised manner, by using an external conventional model to decode alarge corpus of utterances. Each decoded utterance results in a set ofhypotheses. All of the co-occurrences of word n-gram pairs that appearin different hypotheses where the rest of their hypotheses are identicalare counted, and each n-gram pair is scored according to that count. Forinstance, “John Lemon” will get a high score for the target n-gram “JohnLennon” if these can be often found in different hypotheses of the sameutterances.

Next, the fuzzy inventory is used during contextually biased LAStraining. Specifically, given a word n-gram (representing a bias phrasein our case), fuzzy alternatives are selected from the fuzzy inventoryand are sorted by the co-occurrence score. The top ranking candidatesare filtered by keeping only those that are phonetically similar to thetarget n-gram, where phonetic similarity is measured using the Hixonmetric. The selection of fuzzy bias alternatives is done as part of thedata preparation phase.

The set of extracted proper nouns with their fuzzy alternatives are fedas bias phrases 435 to the grapheme and phoneme encoders. In this way,the bias phrases 435 are used to train the ASR model on the example inthe reference transcript 405. This approach allows the model to train ona difficult task of distinguishing between phonetically similar names.

From each training example, at most three proper nouns are selected atrandom and three fuzzy alternative are added for each proper noun. Otherrandom proper nouns are added from the rest of the training data andresult in up to 64 bias phrases per example.

When presented with phonetically similar phrases at test time, the fuzzymodel captures the subtle phonetic differences better than the non-fuzzymodel. This is expressed by both a more accurate prediction. It alsoresults in more attention on the bias phrase that actually appears inthe reference transcript rather than its fuzzy alternatives. Thus,training using fuzzy distractors makes the model discriminatephonetically similar phrases better. Additionally, the fuzzy modelattends more sharply and its distribution is much cleaner than thenon-fuzzy model, which includes incorrect phrases in its attention.

Training contextualized neural speech recognition models with difficultnegative examples results in improved WER. The core idea is to detectand focus on proper nouns (“NNP”) in the reference transcript, andpresent the model with phonetically similar (“fuzzy”) phrases as theirnegative examples. When applied to a speech biasing task, the approachimproves WER by up to 53.1%.

FIG. 5 is a flowchart of an example arrangement of operations for amethod 500 of biasing a speech recognition model 300 toward one or morebiasing phrases 116 relevant to a context 202 of a spoken utterance 104.The method 500 may be executed on data processing hardware 610 (FIG. 6 )residing on a user device 106 associated with a user 102 that spoke theutterance 104. Optionally, the data processing hardware 610 may resideon a remote device (e.g., server of a cloud-based computing environment)in communication with the user device 106, e.g., over a network. Atoperation 502, the method 500 includes receiving, at the data processinghardware 610, audio data 106 encoding the utterance 104, and atoperation 504, the method 500 includes obtaining, by the data processinghardware 610, a set of one or more biasing phrases 116 corresponding tothe context 202 of the utterance 104. Here, each biasing phrase 116 inthe set of one or more biasing phrases 116 includes one or more words.As used herein, the one or more “words” of each biasing phrase 116 mayinclude proper nouns such, as without limitation, people names in acontact list 204 of the user, city names, music artist names,album/movie (or other media content type) titles, etc. The context 202may be determined based on at least one of a current date and/or time ofthe utterance, one or more applications open on the user device 106, alocation of the user 102.

At operation 506, the method 500 includes processing, by the dataprocessing hardware 610, using the speech recognition model 300,acoustic features 109 derived from the audio data and grapheme andphoneme data 118, 120 derived from the set of one or more biasingphrases 116 to generate an output 148 of the speech recognition model300. The output 148 may include a grapheme output for each frame of theacoustic features 109. At operation 508, the method 500 includesdetermining, by the data processing hardware 610, a transcription 146for the utterance 104 based on the output 148 of the speech recognitionmodel 300. The transcription 146 may be displayed on a graphical userinterface 106 of the user device 106 and/or one or more other devices incommunication with the data processing hardware 610. In additionexamples, the transcription 146 is provided as a command to a searchengine or software application to perform an operatino.

A software application (i.e., a software resource) may refer to computersoftware that causes a computing device to perform a task. In someexamples, a software application may be referred to as an “application,”an “app,” or a “program.” Example applications include, but are notlimited to, system diagnostic applications, system managementapplications, system maintenance applications, word processingapplications, spreadsheet applications, messaging applications, mediastreaming applications, social networking applications, and gamingapplications.

The non-transitory memory may be physical devices used to store programs(e.g., sequences of instructions) or data (e.g., program stateinformation) on a temporary or permanent basis for use by a computingdevice. The non-transitory memory may be volatile and/or non-volatileaddressable semiconductor memory. Examples of non-volatile memoryinclude, but are not limited to, flash memory and read-only memory(ROM)/programmable read-only memory (PROM)/erasable programmableread-only memory (EPROM)/electronically erasable programmable read-onlymemory (EEPROM) (e.g., typically used for firmware, such as bootprograms). Examples of volatile memory include, but are not limited to,random access memory (RAM), dynamic random access memory (DRAM), staticrandom access memory (SRAM), phase change memory (PCM) as well as disksor tapes.

FIG. 6 is schematic view of an example computing device 600 that may beused to implement the systems and methods described in this document.The computing device 600 is intended to represent various forms ofdigital computers, such as laptops, desktops, workstations, personaldigital assistants, servers, blade servers, mainframes, and otherappropriate computers. The components shown here, their connections andrelationships, and their functions, are meant to be exemplary only, andare not meant to limit implementations of the inventions describedand/or claimed in this document.

The computing device 600 includes a processor 610, memory 620, a storagedevice 630, a high-speed interface/controller 640 connecting to thememory 620 and high-speed expansion ports 650, and a low speedinterface/controller 660 connecting to a low speed bus 670 and a storagedevice 630. Each of the components 610, 620, 630, 640, 650, and 660, areinterconnected using various busses, and may be mounted on a commonmotherboard or in other manners as appropriate. The processor 610 canprocess instructions for execution within the computing device 600,including instructions stored in the memory 620 or on the storage device630 to display graphical information for a graphical user interface(GUI) on an external input/output device, such as display 680 coupled tohigh speed interface 640. In other implementations, multiple processorsand/or multiple buses may be used, as appropriate, along with multiplememories and types of memory. Also, multiple computing devices 600 maybe connected, with each device providing portions of the necessaryoperations (e.g., as a server bank, a group of blade servers, or amulti-processor system).

The memory 620 stores information non-transitorily within the computingdevice 600. The memory 620 may be a computer-readable medium, a volatilememory unit(s), or non-volatile memory unit(s). The non-transitorymemory 620 may be physical devices used to store programs (e.g.,sequences of instructions) or data (e.g., program state information) ona temporary or permanent basis for use by the computing device 600.Examples of non-volatile memory include, but are not limited to, flashmemory and read-only memory (ROM)/programmable read-only memory(PROM)/erasable programmable read-only memory (EPROM)/electronicallyerasable programmable read-only memory (EEPROM) (e.g., typically usedfor firmware, such as boot programs). Examples of volatile memoryinclude, but are not limited to, random access memory (RAM), dynamicrandom access memory (DRAM), static random access memory (SRAM), phasechange memory (PCM) as well as disks or tapes.

The storage device 630 is capable of providing mass storage for thecomputing device 600. In some implementations, the storage device 630 isa computer-readable medium. In various different implementations, thestorage device 630 may be a floppy disk device, a hard disk device, anoptical disk device, or a tape device, a flash memory or other similarsolid state memory device, or an array of devices, including devices ina storage area network or other configurations. In additionalimplementations, a computer program product is tangibly embodied in aninformation carrier. The computer program product contains instructionsthat, when executed, perform one or more methods, such as thosedescribed above. The information carrier is a computer- ormachine-readable medium, such as the memory 620, the storage device 630,or memory on processor 610.

The high speed controller 640 manages bandwidth-intensive operations forthe computing device 600, while the low speed controller 660 manageslower bandwidth-intensive operations. Such allocation of duties isexemplary only. In some implementations, the high-speed controller 640is coupled to the memory 620, the display 680 (e.g., through a graphicsprocessor or accelerator), and to the high-speed expansion ports 650,which may accept various expansion cards (not shown). In someimplementations, the low-speed controller 660 is coupled to the storagedevice 630 and a low-speed expansion port 690. The low-speed expansionport 690, which may include various communication ports (e.g., USB,Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or moreinput/output devices, such as a keyboard, a pointing device, a scanner,or a networking device such as a switch or router, e.g., through anetwork adapter.

The computing device 600 may be implemented in a number of differentforms, as shown in the figure. For example, it may be implemented as astandard server 600 a or multiple times in a group of such servers 600a, as a laptop computer 600 b, or as part of a rack server system 600 c.

Various implementations of the systems and techniques described hereincan be realized in digital electronic and/or optical circuitry,integrated circuitry, specially designed ASICs (application specificintegrated circuits), computer hardware, firmware, software, and/orcombinations thereof. These various implementations can includeimplementation in one or more computer programs that are executableand/or interpretable on a programmable system including at least oneprogrammable processor, which may be special or general purpose, coupledto receive data and instructions from, and to transmit data andinstructions to, a storage system, at least one input device, and atleast one output device.

These computer programs (also known as programs, software, softwareapplications or code) include machine instructions for a programmableprocessor, and can be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage. As used herein, the terms “machine-readable medium” and“computer-readable medium” refer to any computer program product,non-transitory computer readable medium, apparatus and/or device (e.g.,magnetic discs, optical disks, memory, Programmable Logic Devices(PLDs)) used to provide machine instructions and/or data to aprogrammable processor, including a machine-readable medium thatreceives machine instructions as a machine-readable signal. The term“machine-readable signal” refers to any signal used to provide machineinstructions and/or data to a programmable processor.

The processes and logic flows described in this specification can beperformed by one or more programmable processors, also referred to asdata processing hardware, executing one or more computer programs toperform functions by operating on input data and generating output. Theprocesses and logic flows can also be performed by special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit). Processors suitable for theexecution of a computer program include, by way of example, both generaland special purpose microprocessors, and any one or more processors ofany kind of digital computer. Generally, a processor will receiveinstructions and data from a read only memory or a random access memoryor both. The essential elements of a computer are a processor forperforming instructions and one or more memory devices for storinginstructions and data. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto optical disks, or optical disks. However, a computer need nothave such devices. Computer readable media suitable for storing computerprogram instructions and data include all forms of non-volatile memory,media and memory devices, including by way of example semiconductormemory devices, e.g., EPROM, EEPROM, and flash memory devices; magneticdisks, e.g., internal hard disks or removable disks; magneto opticaldisks; and CD ROM and DVD-ROM disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, one or more aspects of thedisclosure can be implemented on a computer having a display device,e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, ortouch screen for displaying information to the user and optionally akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's client device in response to requests received from the webbrowser.

A number of implementations have been described. Nevertheless, it willbe understood that various modifications may be made without departingfrom the spirit and scope of the disclosure. Accordingly, otherimplementations are within the scope of the following claims.

What is claimed is:
 1. A computer-implemented method that when executedon data processing hardware causes the data processing hardware toperform operations comprising: receiving audio data encoding one or morewords of an utterance; obtaining grapheme data and phoneme data derivedfrom the one or more words of the utterance, the one or more wordscomprising a proper noun; generating, using a grapheme encoderconfigured to receive the grapheme data, grapheme encodings; generating,using a phoneme encoder configured to receive the phoneme data, phonemeencodings; generating, using an attention module configured to receive arepresentation of the grapheme encodings output from the graphemeencoder and the phoneme encodings output from the phoneme encoder,attention outputs; and processing, using a decoder, the attentionoutputs generated by the attention module to determine likelihoods ofspeech elements.
 2. The computer-implemented method of claim 1, whereinthe operations further comprise generating a transcription of theutterance based on the likelihoods of speech elements.
 3. Thecomputer-implemented method of claim 1, wherein the grapheme encoder andthe phoneme encoder each comprise neural networks.
 4. Thecomputer-implemented method of claim 1, wherein: the grapheme encoder isconfigured to generate a corresponding grapheme encoding for aparticular word; the phoneme encoder is configured to generate acorresponding phoneme encoding for the particular word; and theattention module is configured to encode a corresponding secondattention output that comprises a corresponding contextual biasingvector for the particular word based on the corresponding grapheme andphoneme encodings for the particular word.
 5. The computer-implementedmethod of claim 1, wherein the representation of the grapheme encodingsoutput from the grapheme encoder and the phoneme encodings output fromthe phoneme encoder comprises a concatenation between the graphemeencodings and the phoneme encodings.
 6. The computer-implemented methodof claim 5, wherein the concatenation between the grapheme encodings andthe phoneme encodings is represented by a projection vector input to theattention module.
 7. The computer-implemented method of claim 1, whereinthe grapheme encoder, the phoneme encoder, the attention module, and thedecoder are trained jointly to predict a sequence of graphemes.
 8. Thecomputer-implemented method of claim 1, wherein the speech elementscomprise graphemes.
 9. The computer-implemented method of claim 1,wherein the speech elements comprise words or wordpieces.
 10. Thecomputer-implemented method of claim 1, wherein the operations furthercomprise determining a context of the utterance based on at least oneof: a location of a user that spoke the utterance; one or moreapplications open on a user device associated with a user that spoke theutterance; or a current date and/or time of the utterance.
 11. A systemcomprising: data processing hardware; and memory hardware incommunication with the data processing hardware, the memory hardwarestoring instructions that when executed on the data processing hardwarecause the data processing hardware to perform operations comprising:receiving audio data encoding one or more words of an utterance;obtaining grapheme data and phoneme data derived from the one or morewords of the utterance, the one or more words comprising a proper noun;generating, using a grapheme encoder configured to receive the graphemedata, grapheme encodings; generating, using a phoneme encoder configuredto receive the phoneme data, phoneme encodings; generating, using anattention module configured to receive a representation of the graphemeencodings output from the grapheme encoder and the phoneme encodingsoutput from the phoneme encoder, attention outputs; and processing,using a decoder, the attention outputs generated by the attention moduleto determine likelihoods of speech elements.
 12. The system of claim 11,wherein the operations further comprise generating a transcription ofthe utterance based on the likelihoods of speech elements.
 13. Thesystem of claim 11, wherein the grapheme encoder and the phoneme encodereach comprise neural networks.
 14. The system of claim 11, wherein: thegrapheme encoder is configured to generate a corresponding graphemeencoding for a particular word; the phoneme encoder is configured togenerate a corresponding phoneme encoding for the particular word; andthe attention module is configured to encode a corresponding secondattention output that comprises a corresponding contextual biasingvector for the particular word based on the corresponding grapheme andphoneme encodings for the particular word.
 15. The system of claim 11,wherein the representation of the grapheme encodings output from thegrapheme encoder and the phoneme encodings output from the phonemeencoder comprises a concatenation between the grapheme encodings and thephoneme encodings.
 16. The system of claim 15, wherein the concatenationbetween the grapheme encodings and the phoneme encodings is representedby a projection vector input to the attention module.
 17. The system ofclaim 11, wherein the grapheme encoder, the phoneme encoder, theattention module, and the decoder are trained jointly to predict asequence of graphemes.
 18. The system of claim 11, wherein the speechelements comprise graphemes.
 19. The system of claim 11, wherein thespeech elements comprise words or wordpieces.
 20. The system of claim11, wherein the operations further comprise determining a context of theutterance based on at least one of: a location of a user that spoke theutterance; one or more applications open on a user device associatedwith a user that spoke the utterance; or a current date and/or time ofthe utterance.